Take-home Exercise 3b: Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods

Author

Luo Yuming

Published

October 24, 2024

Modified

November 15, 2024


1. Overview

This exercise focuses on predicting HDB resale prices in Singapore using geographically weighted machine learning methods. By incorporating spatial components, we aim to understand regional patterns and improve prediction accuracy.


2. Objectives

The goals of this exercise are to:

  1. Apply geographically weighted machine learning models to predict HDB resale prices.
  2. Assess the performance of different models, including random forest and geographically weighted random forest (GWRF).
  3. Visualize and interpret spatial variations in predicted prices.

3. Getting Started

3.1 Required Packages

In this exercise, we will use the following packages:

Package Description
sf Provides functions for reading, processing, and visualizing spatial data in the “Simple Features” format, enabling spatial data handling in R.
spdep Provides tools for spatial dependency modeling, including spatial weights and measures for spatial autocorrelation, such as Moran’s I, useful for detecting spatial patterns.
tidyverse A suite of R packages designed for data manipulation (dplyr, tidyr), visualization (ggplot2), and other common data science tasks, improving data handling and analysis.
tmap A flexible package for creating static and interactive maps, allowing cartographic-quality visualizations of spatial data.
GWmodel Contains functions for Geographically Weighted Regression (GWR) and other spatially weighted models, allowing local modeling of spatial data where relationships can vary across geographic space.
caret A comprehensive package for machine learning, providing tools for model training, tuning, and evaluation, supporting methods like cross-validation and hyperparameter tuning.
ranger An efficient implementation of the random forest algorithm optimized for large datasets, used for predictive modeling and capable of handling both classification and regression tasks.
httr Facilitates HTTP requests in R, useful for connecting to APIs like OneMap to retrieve geographic coordinates based on addresses.
jsonlite A package for working with JSON data in R, enabling easy conversion of JSON data from web APIs into R data frames.
# Load necessary packages
pacman::p_load(tidyverse, sf, spdep, GWmodel, tmap, caret, ranger, httr, jsonlite, ggplot2, ggpubr, ggstatsplot, units, olsrr, gtsummary, SpatialML,Metrics)

3.2 The Data

Dataset Name Description Format Source
Master Plan 2019 Subzone Geospatial data representing the boundaries of different regions in Singapore. Shapefile Singapore Land Authority
HDB Resale Transactions Aspatial data containing information on HDB resale prices, house type and location. CSV Singapore Open Data Portal
Various Facilities Data Includes data on the location of elderly care centres, supermarkets, hawker centres, parks, etc. for analysing the impact of proximity on price. GeoJSON Various open sources in Singapore
MRT/Bus Stations Data on the location of MRT and Bus stations in Singapore is used to calculate the impact of commuting convenience on house prices. Shapefile Singapore Land Transport Authority

4. Data Preprocessing

4.1 Loading and Filtering Resale Data

4.1.1 Geospatial Data

mpsz <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL") %>%
  st_transform(3414)
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
bus_stops <- st_read(dsn = "data/geospatial", layer = "BusStop") %>%
  st_transform(crs = 3414)
Reading layer `BusStop' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5166 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48285.52 ymax: 52983.82
Projected CRS: SVY21
train_stations <- st_read(dsn = "data/geospatial", layer = "RapidTransitSystemStation") %>%
  st_transform(crs = 3414) %>%
  filter(grepl("STATION|MRT", STN_NAM_DE))
Reading layer `RapidTransitSystemStation' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial' 
  using driver `ESRI Shapefile'
Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : GDAL
Message 1: Non closed ring detected. To avoid accepting it, set the
OGR_GEOMETRY_ACCEPT_UNCLOSED_RING configuration option to NO
Simple feature collection with 231 features and 7 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
hawker_center <- st_read("data/geospatial/HawkerCentresGEOJSON.geojson") %>%
  st_transform(crs = 3414)
Reading layer `HawkerCentresGEOJSON' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial/HawkerCentresGEOJSON.geojson' 
  using driver `GeoJSON'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
clinics <- st_read("data/geospatial/CHASClinics.geojson") %>%
  st_transform(crs = 3414)
Reading layer `CHASClinics' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial/CHASClinics.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1193 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.5818 ymin: 1.016264 xmax: 103.9903 ymax: 1.456037
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
supermarkets <- st_read("data/geospatial/SupermarketsGEOJSON.geojson") %>%
  st_transform(crs = 3414)
Reading layer `SupermarketsGEOJSON' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial/SupermarketsGEOJSON.geojson' 
  using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
schoolzone <- st_read("data/geospatial/LTASchoolZone.geojson") %>%
  st_transform(crs = 3414)
Reading layer `LTASchoolZone' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial/LTASchoolZone.geojson' 
  using driver `GeoJSON'
Simple feature collection with 211 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY, XYZ
Bounding box:  xmin: 103.687 ymin: 1.272736 xmax: 103.9668 ymax: 1.457587
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
Warning in CPL_transform(x, crs, aoi, pipeline, reverse, desired_accuracy, :
GDAL Message 1: Sub-geometry 0 has coordinate dimension 2, but container has 3
Warning in CPL_transform(x, crs, aoi, pipeline, reverse, desired_accuracy, :
GDAL Message 1: Sub-geometry 1 has coordinate dimension 2, but container has 3

4.1.2 Aspatial Data

resale <- read_csv("data/aspatial/resale.csv") %>%
  filter(month >= "2023-01" & month <= "2024-09")
Rows: 192613 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
dbl (3): floor_area_sqm, lease_commence_date, resale_price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
get_coords <- function(add_list){
  
  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
    
  for (i in add_list){
    #print(i)

    r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, 
                            postal = postal, 
                            latitude = lat, 
                            longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, 
                                postal = NA, 
                                latitude = NA, 
                                longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, 
                              postal = postal, 
                              latitude = lat, 
                              longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, 
                            postal = NA, 
                            latitude = NA, 
                            longitude = NA)
    }
    
    # Add the row
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}

4.2 Z-Dimensions Removal

We’ll use the st_zm() function to remove the z-dimension from the data.

hawker_center <- st_zm(hawker_center)
clinics <- st_zm(clinics)
supermarkets <- st_zm(supermarkets)
schoolzone<- st_zm(schoolzone)

4.3 Duel with invalid geometries

# Check for invalid geometries
datasets <- list(mpsz = mpsz, bus_stops = bus_stops, train_stations = train_stations, 
                 hawker_center = hawker_center, clinics = clinics, 
                 supermarkets = supermarkets, schoolzone = schoolzone)

for (name in names(datasets)) {
  invalid_count <- sum(!st_is_valid(datasets[[name]]))
  cat("Number of invalid geometries in", name, ":", invalid_count, "\n")
}
Number of invalid geometries in mpsz : 9 
Number of invalid geometries in bus_stops : 0 
Number of invalid geometries in train_stations : 2 
Number of invalid geometries in hawker_center : 0 
Number of invalid geometries in clinics : 0 
Number of invalid geometries in supermarkets : 0 
Number of invalid geometries in schoolzone : 0 
# Repair invalid geometries
for (name in names(datasets)) {
  if (sum(!st_is_valid(datasets[[name]])) > 0) {
    datasets[[name]] <- st_make_valid(datasets[[name]])
    cat("Repaired invalid geometries in", name, "\n")
  }
  
  # Check again to confirm all geometries are now valid
  invalid_count <- sum(!st_is_valid(datasets[[name]]))
  cat("After repair, number of invalid geometries in", name, ":", invalid_count, "\n")
}
Repaired invalid geometries in mpsz 
After repair, number of invalid geometries in mpsz : 0 
After repair, number of invalid geometries in bus_stops : 0 
Repaired invalid geometries in train_stations 
After repair, number of invalid geometries in train_stations : 0 
After repair, number of invalid geometries in hawker_center : 0 
After repair, number of invalid geometries in clinics : 0 
After repair, number of invalid geometries in supermarkets : 0 
After repair, number of invalid geometries in schoolzone : 0 

4.4 Data Transformation

resale_tidy <- resale %>%
  mutate(address = paste(block,street_name)) %>%
  mutate(remaining_lease_yr = as.integer(
    str_sub(remaining_lease, 0, 2)))%>%
  mutate(remaining_lease_mth = as.integer(
    str_sub(remaining_lease, 9, 11)))
resale_selected <- resale_tidy %>%
  filter(month == "2024-09")
add_list <- sort(unique(resale_selected$address))

4.5 Saving the Geocoded Coordinates

coords <- get_coords(add_list)
write_rds(coords, "data/rds/coords.rds")

4.6 Data Wrangling and Joining Coordinates

# Join coordinates with filtered resale data
resale_geo <- resale_selected %>%
  left_join(coords, by = "address") %>%
  filter(!is.na(latitude) & !is.na(longitude))%>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs = 3414)

4.7 Data Data Visualization

# Enable automatic fixing of invalid polygons
tmap_options(check.and.fix = TRUE)

# Switch to interactive view mode
tmap_mode("view")
tmap mode set to interactive viewing
# Interactive map with mpsz as base
tm_shape(mpsz) +
  tm_polygons(col = "white", border.col = "grey") +
  
  # Add layers interactively
  tm_shape(resale_geo) +
  tm_symbols(col = "yellow", size = 0.1, shape = 19) +
   
  tm_shape(clinics) +
  tm_symbols(col = "blue", size = 0.1, shape = 19) +
  
  tm_shape(hawker_center) +
  tm_symbols(col = "red", size = 0.15, shape = 21) +
  
  tm_shape(train_stations) +
  tm_symbols(col = "green", size = 0.1, shape = 19) +
  
  tm_shape(bus_stops) +
  tm_symbols(col = "grey", size = 0.05, shape = 20) +

  tm_shape(supermarkets) +
  tm_symbols(col = "brown", size = 0.1, shape = 19) +
  
  tm_layout(title = "Interactive Map of Amenities in Singapore")
Warning: The shape mpsz is invalid (after reprojection). See sf::st_is_valid
Warning: The shape train_stations is invalid (after reprojection). See
sf::st_is_valid

4.7.1 Filtering Data to Singapore Boundaries

# Ensure bus_stops are within the mpsz boundary
bus_stops <- st_intersection(bus_stops, mpsz)
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
clinics <- st_intersection(clinics, mpsz)
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
# Visualize the filtered data to confirm that all amenities are within Singapore
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(mpsz) +
  tm_polygons(col = "white", border.col = "grey") +
  tm_shape(resale_geo) +
  tm_symbols(col = "yellow", size = 0.1, shape = 19) +
  tm_shape(clinics) +
  tm_symbols(col = "blue", size = 0.1, shape = 19) +
  tm_shape(hawker_center) +
  tm_symbols(col = "red", size = 0.15, shape = 21) +
  tm_shape(train_stations) +
  tm_symbols(col = "green", size = 0.1, shape = 19) +
  tm_shape(bus_stops) +  # Filtered bus stops
  tm_symbols(col = "grey", size = 0.05, shape = 20) +
  tm_shape(supermarkets) +
  tm_symbols(col = "brown", size = 0.1, shape = 19) +
  tm_layout(title = "Filtered Interactive Map of Amenities in Singapore")
Warning: The shape mpsz is invalid (after reprojection). See sf::st_is_valid
Warning: The shape train_stations is invalid (after reprojection). See
sf::st_is_valid

5. Data Wrangling

5.1. Structural Factors

Flat Type:

# Filter `resale_geo` into separate data frames based on flat type
resale_geo_3room <- resale_geo %>% filter(flat_type == "3 ROOM")
resale_geo_4room <- resale_geo %>% filter(flat_type == "4 ROOM")
resale_geo_5room <- resale_geo %>% filter(flat_type == "5 ROOM")

Area of the Unit: Check if you have a floor_area_sqm column.

# Convert area to numeric if necessary and handle missing values
resale_geo_4room <- resale_geo_4room %>%
  mutate(floor_area_sqm = as.numeric(floor_area_sqm)) %>%
  drop_na(floor_area_sqm)

Floor Level: If floor information is represented in ranges (e.g., “01 TO 03”), convert it to an approximate floor level (e.g., the median floor of the range). Create a new column, floor_level, to store the converted floor level.

# Convert `storey_range` to approximate floor level
resale_geo_4room  <- resale_geo_4room  %>%
  mutate(floor_level = (as.numeric(str_extract(storey_range, "^[0-9]+")) +
                     as.numeric(str_extract(storey_range, "[0-9]+$"))) / 2)

Age of the Unit:

# Calculate age of the unit
current_year <- as.numeric(format(Sys.Date(), "%Y"))

resale_geo_4room <- resale_geo_4room %>%
  mutate(age = current_year - lease_commence_date)

5.2 Locational Factors

Proximity to Key Amenities Calculate the distance from each HDB flat to various key amenities like MRT, parks, and shopping malls. Use st_distance to find the nearest amenity in each category.

# Function to calculate proximity to the nearest amenity with units in meters
calculate_nearest <- function(resale_data, amenity_data, amenity_name) {
  # Calculate distances and get the minimum for each resale point
  nearest_distances <- st_distance(resale_data, amenity_data) %>%
    apply(1, min)
  
  # Set units to meters if not already
  nearest_distances <- set_units(nearest_distances, "m", mode = "standard")
  
  # Add the distances as a new column in the resale data
  resale_data[[amenity_name]] <- nearest_distances
  return(resale_data)
}

# Calculate distances to amenities for resale_geo_4room
resale_geo_4room <- calculate_nearest(resale_geo_4room, train_stations, "dist_to_mrt")
resale_geo_4room <- calculate_nearest(resale_geo_4room, bus_stops, "dist_to_bus")
resale_geo_4room <- calculate_nearest(resale_geo_4room, supermarkets, "dist_to_supermarket")

Count of Facilities within Radius Use the calculate_amenity_count function to calculate the number of facilities within a specific radius of each flat. For each amenity type, specify a radius and create a new column indicating the count of nearby facilities.

# Function to calculate the count of amenities within a specified radius
calculate_amenity_count <- function(resale_data, amenity_data, radius, amenity_name) {
  # Calculate if each resale point has any amenities within the radius
  nearby_count <- st_is_within_distance(resale_data, amenity_data, dist = radius) %>% lengths()
  
  # Print summary for debugging
  print(paste("Summary for", amenity_name, "within", radius, "meters:"))
  print(summary(nearby_count))
  
  # Add count as a new column in resale_data
  resale_data <- resale_data %>%
    mutate(!!paste0("count_", amenity_name) := nearby_count)
  
  return(resale_data)
}

# Calculate counts for each amenity type within a 500-meter radius
resale_geo_4room <- calculate_amenity_count(resale_geo_4room, clinics, 500, "clinics")
[1] "Summary for clinics within 500 meters:"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   4.000   5.000   6.012   8.000  21.000 
resale_geo_4room <- calculate_amenity_count(resale_geo_4room, hawker_center, 500, "hawker_center")
[1] "Summary for hawker_center within 500 meters:"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.5335  1.0000  4.0000 
resale_geo_4room<- calculate_amenity_count(resale_geo_4room, schoolzone, 300, "schoolzone")
[1] "Summary for schoolzone within 300 meters:"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  1.0000  0.6847  1.0000  3.0000 

Filtering out unnecessary columns and saving the final dataset to a file is a good practice. This will help streamline analysis and avoid redundant calculations.

# Rename columns for resale_geo_4room (similar steps can be applied to other datasets)
resale_new_4room <- resale_geo_4room %>%
  rename(
    MONTH = month,
    TOWN = town,
    FLOOR_AREA_SQM = floor_area_sqm,
    ADDRESS = address,
    FLAT_MODEL = flat_model,
    RESALE_PRICE = resale_price,
    FLOOR_LEVEL = floor_level,
    REMAINING_LEASE_YEARS = remaining_lease_yr,
    UNIT_AGE = age,
    PROX_MRT = dist_to_mrt,
    PROX_BUS_STOP = dist_to_bus,
    PROX_SUPERMARKET = dist_to_supermarket
  ) %>%
  select(
    MONTH, TOWN, FLOOR_AREA_SQM, ADDRESS, FLAT_MODEL, RESALE_PRICE, FLOOR_LEVEL, REMAINING_LEASE_YEARS, UNIT_AGE,
    PROX_MRT, PROX_BUS_STOP, PROX_SUPERMARKET, count_clinics, count_hawker_center, count_schoolzone
  )

6.Exploratory Data Analysis (EDA)

In this section, we’ll use statistical graphics functions from packages like ggplot2 and tmap to conduct exploratory data analysis (EDA) of HDB resale data. This analysis will help us understand the feature distributions and spatial patterns of the data.

6.1 Analyzing HDB Resale Price Distribution

We start by plotting the distribution of resale prices to observe skewness or outliers in the data. The code below creates an initial histogram of resale price distribution:

# Histogram showing HDB resale price distribution
ggplot(data = resale_new_4room, aes(x = RESALE_PRICE)) +
  geom_histogram(bins = 20, color = "black", fill = "light blue") +
  labs(title = "Distribution of HDB Resale Prices", x = "Resale Price", y = "Frequency")

tm_shape(mpsz)+
  tm_polygons(col = "white") +
tm_shape(resale_new_4room) +  
  tm_dots(col = "RESALE_PRICE",
          alpha = 0.6,
          style="quantile")
Warning: The shape mpsz is invalid (after reprojection). See sf::st_is_valid

6.3 Exploring Structural Factors

Investigate structural factors such as floor area, age, remaining lease, and floor level to see how they correlate with resale price.

# Scatter plot for Resale Price vs Floor Area
ggplot(resale_new_4room, aes(x = FLOOR_AREA_SQM, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "darkblue") +
  labs(title = "Resale Price vs Floor Area", x = "Floor Area (sqm)", y = "Resale Price") +
  theme_minimal()

# Scatter plot for Resale Price vs Age of Flat
ggplot(resale_new_4room, aes(x = UNIT_AGE, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "darkgreen") +
  labs(title = "Resale Price vs Age of Flat", x = "Age of Flat (years)", y = "Resale Price") +
  theme_minimal()

# Scatter plot for Resale Price vs Age of Flat
ggplot(resale_new_4room, aes(x = FLOOR_LEVEL, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "darkred") +
  labs(title = "Resale Price vs Age of Flat", x = "Age of Flat (years)", y = "Resale Price") +
  theme_minimal()

6.4 Scatter Plots of Resale Price vs. Distance to Amenities

# Scatter plot for Resale Price vs Distance to MRT
ggplot(resale_new_4room, aes(x = PROX_MRT, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "purple") +
  labs(title = "Resale Price vs Distance to MRT", 
       x = "Distance to MRT (meters)", y = "Resale Price") +
  theme_minimal()
Warning: The `scale_name` argument of `continuous_scale()` is deprecated as of ggplot2
3.5.0.

# Scatter plot for Resale Price vs Distance to Bus Stop
ggplot(resale_new_4room, aes(x = PROX_BUS_STOP, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "green") +
  labs(title = "Resale Price vs Distance to Bus Stop", 
       x = "Distance to Bus Stop (meters)", y = "Resale Price") +
  theme_minimal()

# Scatter plot for Resale Price vs Distance to Supermarket
ggplot(resale_new_4room, aes(x = PROX_SUPERMARKET, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "orange") +
  labs(title = "Resale Price vs Distance to Supermarket", 
       x = "Distance to Supermarket (meters)", y = "Resale Price") +
  theme_minimal()

### 6.5 Count of Amenities within Radius

# Scatter plot for Resale Price vs Count of Nearby Clinics
ggplot(resale_new_4room, aes(x = count_clinics, y = RESALE_PRICE)) +
  geom_jitter(width = 0.2, color = "blue", alpha = 0.5) +
  labs(title = "Resale Price vs Number of Nearby Clinics", 
       x = "Count of Nearby Clinics", y = "Resale Price") +
  theme_minimal()

# Scatter plot for Resale Price vs Count of Nearby Hawker Centers
ggplot(resale_new_4room, aes(x = count_hawker_center, y = RESALE_PRICE)) +
  geom_jitter(width = 0.2, color = "red", alpha = 0.5) +
  labs(title = "Resale Price vs Number of Nearby Hawker Centers", 
       x = "Count of Nearby Hawker Centers", y = "Resale Price") +
  theme_minimal()

# Scatter plot for Resale Price vs Count of Nearby Hawker Centers
ggplot(resale_new_4room, aes(x = count_schoolzone, y = RESALE_PRICE)) +
  geom_jitter(width = 0.2, color = "green", alpha = 0.5) +
  labs(title = "Resale Price vs Number of Nearby Schoolzone", 
       x = "Count of Nearby School zone", y = "Resale Price") +
  theme_minimal()

6.5Correlation Analysis

resale_new_4room_nogeo <- resale_new_4room %>%
  st_drop_geometry()
ggcorrmat(resale_new_4room_nogeo, names(resale_new_4room_nogeo))

7 Building a non-spatial multiple linear regression

7.1 Prepare Data for Regression

Remove REMAINING_LEASE_YEARS and ensure all relevant variables are formatted correctly. You might also need to convert categorical variables, such as TOWN and FLAT_MODEL, into dummy variables or factors if they’re not already.

# Remove the `REMAINING_LEASE_YEARS` column
resale_new_4room <- resale_new_4room %>% select(-REMAINING_LEASE_YEARS)

# Convert categorical variables to factors
resale_new_4room$TOWN <- as.factor(resale_new_4room$TOWN)
resale_new_4room$FLAT_MODEL <- as.factor(resale_new_4room$FLAT_MODEL)


# Set a random seed for reproducibility
set.seed(123)

# Create training (80%) and testing (20%) data splits
train_index <- createDataPartition(resale_new_4room$RESALE_PRICE, p = 0.8, list = FALSE)
train_data <- resale_new_4room[train_index, ]
test_data <- resale_new_4room[-train_index, ]

7.2 Building the Model on the Training Data

# Build the model on the training set
train_mlr <- lm(RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE+ FLOOR_LEVEL + PROX_MRT +PROX_BUS_STOP+ PROX_SUPERMARKET + count_clinics + count_hawker_center + count_schoolzone, data = train_data)
ols_regress(train_mlr)
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.780       RMSE                    97612.564 
R-Squared                   0.608       MSE                9528212723.287 
Adj. R-Squared              0.603       Coef. Var                  15.117 
Pred R-Squared              0.597       AIC                     19177.027 
MAE                     76510.381       SBC                     19227.730 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                    ANOVA                                     
-----------------------------------------------------------------------------
                    Sum of                                                   
                   Squares         DF       Mean Square       F         Sig. 
-----------------------------------------------------------------------------
Regression    1.096911e+13          9       1.21879e+12     126.19    0.0000 
Residual      7.069934e+12        732    9658379563.769                      
Total         1.803904e+13        741                                        
-----------------------------------------------------------------------------

                                             Parameter Estimates                                              
-------------------------------------------------------------------------------------------------------------
              model          Beta    Std. Error    Std. Beta       t        Sig          lower         upper 
-------------------------------------------------------------------------------------------------------------
        (Intercept)    302289.645     55615.316                   5.435    0.000    193105.097    411474.192 
     FLOOR_AREA_SQM      3500.832       575.730        0.152      6.081    0.000      2370.552      4631.111 
           UNIT_AGE     -4443.227       291.178       -0.396    -15.259    0.000     -5014.870     -3871.584 
        FLOOR_LEVEL      9279.482       567.665        0.412     16.347    0.000      8165.038     10393.927 
           PROX_MRT       -72.858         9.811       -0.178     -7.426    0.000       -92.119       -53.596 
      PROX_BUS_STOP       195.219        67.556        0.068      2.890    0.004        62.593       327.845 
   PROX_SUPERMARKET        -2.982        25.658       -0.003     -0.116    0.908       -53.354        47.390 
      count_clinics      6459.330      1273.603        0.132      5.072    0.000      3958.979      8959.680 
count_hawker_center     48602.317      5341.484        0.235      9.099    0.000     38115.862     59088.772 
   count_schoolzone    -31888.927      5094.498       -0.151     -6.259    0.000    -41890.498    -21887.357 
-------------------------------------------------------------------------------------------------------------
tbl_regression(train_mlr, intercept = TRUE)
Characteristic Beta 95% CI1 p-value
(Intercept) 302,290 193,105, 411,474 <0.001
FLOOR_AREA_SQM 3,501 2,371, 4,631 <0.001
UNIT_AGE -4,443 -5,015, -3,872 <0.001
FLOOR_LEVEL 9,279 8,165, 10,394 <0.001
PROX_MRT -73 -92, -54 <0.001
PROX_BUS_STOP 195 63, 328 0.004
PROX_SUPERMARKET -3.0 -53, 47 >0.9
count_clinics 6,459 3,959, 8,960 <0.001
count_hawker_center 48,602 38,116, 59,089 <0.001
count_schoolzone -31,889 -41,890, -21,887 <0.001
1 CI = Confidence Interval

Check for Multicollinearity

ols_vif_tol(train_mlr)
            Variables Tolerance      VIF
1      FLOOR_AREA_SQM 0.8599615 1.162843
2            UNIT_AGE 0.7969074 1.254851
3         FLOOR_LEVEL 0.8432154 1.185937
4            PROX_MRT 0.9329164 1.071907
5       PROX_BUS_STOP 0.9648531 1.036427
6    PROX_SUPERMARKET 0.9265095 1.079320
7       count_clinics 0.7962761 1.255846
8 count_hawker_center 0.8044852 1.243031
9    count_schoolzone 0.9201214 1.086813

7.3 Adjustments to Improve the Model

Since PROX_SUPERMARKET has a very high p-value, removing it could simplify the model without losing predictive power.

Additionally, some predictors, such as FLOOR_AREA_SQM or UNIT_AGE, could be transformed if their relationships with RESALE_PRICE are non-linear (e.g., log transformation for positive, skewed predictors).

# Re-fit the model without PROX_SUPERMARKET
train_data2 <- train_data %>%
    mutate(LOG_RESALE_PRICE = log(RESALE_PRICE))

train_mlr2 <- lm(LOG_RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE + FLOOR_LEVEL + PROX_MRT + 
                PROX_BUS_STOP + count_clinics + count_hawker_center + count_schoolzone, 
                data = train_data2)
ols_regress(train_mlr2)
                         Model Summary                           
----------------------------------------------------------------
R                       0.786       RMSE                  0.134 
R-Squared               0.617       MSE                   0.018 
Adj. R-Squared          0.613       Coef. Var             1.008 
Pred R-Squared          0.608       AIC                -859.410 
MAE                     0.107       SBC                -813.316 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
               Sum of                                                
              Squares         DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression     21.403          8          2.675    147.665    0.0000 
Residual       13.280        733          0.018                      
Total          34.683        741                                     
---------------------------------------------------------------------

                                       Parameter Estimates                                        
-------------------------------------------------------------------------------------------------
              model      Beta    Std. Error    Std. Beta       t        Sig      lower     upper 
-------------------------------------------------------------------------------------------------
        (Intercept)    12.830         0.076                 169.904    0.000    12.681    12.978 
     FLOOR_AREA_SQM     0.006         0.001        0.178      7.235    0.000     0.004     0.007 
           UNIT_AGE    -0.007         0.000       -0.447    -17.582    0.000    -0.008    -0.006 
        FLOOR_LEVEL     0.012         0.001        0.378     15.203    0.000     0.010     0.013 
           PROX_MRT     0.000         0.000       -0.189     -7.981    0.000     0.000     0.000 
      PROX_BUS_STOP     0.000         0.000        0.065      2.804    0.005     0.000     0.000 
      count_clinics     0.009         0.002        0.132      5.221    0.000     0.006     0.012 
count_hawker_center     0.066         0.007        0.228      8.970    0.000     0.051     0.080 
   count_schoolzone    -0.045         0.007       -0.155     -6.504    0.000    -0.059    -0.032 
-------------------------------------------------------------------------------------------------

7.4 Test for Normality

ols_plot_resid_hist(train_mlr2)

7.5 Test for Spatial Autocorrelation

mlr_res <- as.data.frame(train_mlr2$residuals)
resale_res <- cbind(train_data2,
                    mlr_res) %>%
  rename(MLR_RES = train_mlr2.residuals)

tm_shape(mpsz)+
  tm_polygons(col = "white") +
tm_shape(resale_res) +  
  tm_dots(col = "MLR_RES",
          alpha = 0.6,
          style="quantile") +
  tm_layout(main.title = "Multiple Linear Regression Residuals",     
            main.title.position = "center",
            main.title.size = 1)
Warning: The shape mpsz is invalid (after reprojection). See sf::st_is_valid
Variable(s) "MLR_RES" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
tmap_mode("plot")
tmap mode set to plotting

The figure above reveal that there is sign of spatial autocorrelation.

To proof that our observation is indeed true, the Moran’s I test will be performed

resale_sp <- as_Spatial(resale_res)

nb <- dnearneigh(coordinates(resale_sp), 0, 2000, longlat = FALSE)

nb_lw <- nb2listw(nb, style = 'W')

lm.morantest(train_mlr2, nb_lw)

    Global Moran I for regression residuals

data:  
model: lm(formula = LOG_RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE +
FLOOR_LEVEL + PROX_MRT + PROX_BUS_STOP + count_clinics +
count_hawker_center + count_schoolzone, data = train_data2)
weights: nb_lw

Moran I statistic standard deviate = 50.353, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Observed Moran I      Expectation         Variance 
    4.399105e-01    -3.350891e-03     7.749292e-05 

8.GWR & GWRF Predictive Method

8.1 Converting to SpatialPointsDataFrame

train_data_sp <- as_Spatial(train_data)

8.2 Computing Adaptive Bandwidth

To identify the optimal bandwidth, use cross-validation (CV) within the bw.gwr() function. This step helps in determining the appropriate number of neighbors for each local regression.

bw_adaptive <- bw.gwr(RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE+ FLOOR_LEVEL + PROX_MRT +PROX_BUS_STOP+ PROX_SUPERMARKET + count_clinics + count_hawker_center + count_schoolzone,
                      data=train_data_sp,
                      approach="CV",
                      kernel="gaussian",
                      adaptive=TRUE,
                      longlat=FALSE)
Adaptive bandwidth: 466 CV score: 6.46379e+12 
Adaptive bandwidth: 296 CV score: 5.894423e+12 
Adaptive bandwidth: 190 CV score: 5.159191e+12 
Adaptive bandwidth: 125 CV score: 4.260186e+12 
Adaptive bandwidth: 84 CV score: 3.483417e+12 
Adaptive bandwidth: 59 CV score: 2.624868e+12 
Adaptive bandwidth: 43 CV score: 2.08822e+12 
Adaptive bandwidth: 34 CV score: 1.906203e+12 
Adaptive bandwidth: 27 CV score: 1.772778e+12 
Adaptive bandwidth: 24 CV score: 1.761192e+12 
Adaptive bandwidth: 21 CV score: 1.702006e+12 
Adaptive bandwidth: 20 CV score: 1.690141e+12 
Adaptive bandwidth: 18 CV score: 1.642465e+12 
Adaptive bandwidth: 18 CV score: 1.642465e+12 
# Save bandwidth result for reuse
write_rds(bw_adaptive, "data/rds/bw_adaptive.rds")

8.3 Constructing the GWR Model

Using the saved bandwidth, fit the GWR model with an adaptive Gaussian kernel.

gwr_adaptive <- gwr.basic(RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE+ FLOOR_LEVEL + PROX_MRT +PROX_BUS_STOP+ PROX_SUPERMARKET + count_clinics + count_hawker_center + count_schoolzone,
                          data=train_data_sp,
                          bw=bw_adaptive, 
                          kernel = 'gaussian', 
                          adaptive=TRUE,
                          longlat = FALSE)

# Save the GWR model for future use
write_rds(gwr_adaptive, "data/rds/gwr_adaptive.rds")

8.4 Retrieving GWR Output

Load the saved model and display the output to examine coefficient values and other diagnostics.

gwr_adaptive <- read_rds("data/rds/gwr_adaptive.rds")
print(gwr_adaptive)
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2024-11-15 04:32:06.559689 
   Call:
   gwr.basic(formula = RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE + 
    FLOOR_LEVEL + PROX_MRT + PROX_BUS_STOP + PROX_SUPERMARKET + 
    count_clinics + count_hawker_center + count_schoolzone, data = train_data_sp, 
    bw = bw_adaptive, kernel = "gaussian", adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  RESALE_PRICE
   Independent variables:  FLOOR_AREA_SQM UNIT_AGE FLOOR_LEVEL PROX_MRT PROX_BUS_STOP PROX_SUPERMARKET count_clinics count_hawker_center count_schoolzone
   Number of data points: 742
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-227173  -64473  -17631   57356  406317 

   Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
   (Intercept)         302289.645  55615.316   5.435 7.45e-08 ***
   FLOOR_AREA_SQM        3500.832    575.730   6.081 1.93e-09 ***
   UNIT_AGE             -4443.227    291.178 -15.259  < 2e-16 ***
   FLOOR_LEVEL           9279.482    567.665  16.347  < 2e-16 ***
   PROX_MRT               -72.858      9.811  -7.426 3.12e-13 ***
   PROX_BUS_STOP          195.219     67.556   2.890  0.00397 ** 
   PROX_SUPERMARKET        -2.982     25.658  -0.116  0.90751    
   count_clinics         6459.330   1273.603   5.072 5.00e-07 ***
   count_hawker_center  48602.317   5341.484   9.099  < 2e-16 ***
   count_schoolzone    -31888.927   5094.498  -6.259 6.58e-10 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 98280 on 732 degrees of freedom
   Multiple R-squared: 0.6081
   Adjusted R-squared: 0.6033 
   F-statistic: 126.2 on 9 and 732 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 7.069934e+12
   Sigma(hat): 97744.38
   AIC:  19177.03
   AICc:  19177.39
   BIC:  18558.43
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 18 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                              Min.     1st Qu.      Median     3rd Qu.
   Intercept           -4.6538e+05  3.3412e+05  4.3010e+05  5.3961e+05
   FLOOR_AREA_SQM      -2.3396e+01  2.6054e+03  3.2933e+03  5.0034e+03
   UNIT_AGE            -1.0103e+04 -7.2183e+03 -4.9476e+03 -4.0259e+03
   FLOOR_LEVEL          3.6600e+02  3.5074e+03  5.0000e+03  6.1503e+03
   PROX_MRT            -2.9873e+02 -1.5011e+02 -9.6621e+01 -3.7140e+01
   PROX_BUS_STOP       -7.6902e+02 -5.6302e+01 -3.7344e+00  1.0510e+02
   PROX_SUPERMARKET    -3.1889e+02 -6.6561e+01 -4.8243e+00  4.1944e+01
   count_clinics       -1.4930e+04 -1.7624e+03  1.9888e+03  5.1396e+03
   count_hawker_center -5.7545e+04 -8.4636e+03  4.6069e+03  1.6099e+04
   count_schoolzone    -7.5350e+04 -1.2103e+04 -2.2751e+03  8.3975e+03
                             Max.
   Intercept           840576.850
   FLOOR_AREA_SQM       11576.366
   UNIT_AGE             -1055.267
   FLOOR_LEVEL          12466.447
   PROX_MRT                63.306
   PROX_BUS_STOP          539.328
   PROX_SUPERMARKET       194.674
   count_clinics        18448.559
   count_hawker_center 157597.607
   count_schoolzone     79652.802
   ************************Diagnostic information*************************
   Number of data points: 742 
   Effective number of parameters (2trace(S) - trace(S'S)): 271.5186 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 470.4814 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 18183.9 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 17782.3 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 18257.58 
   Residual sum of squares: 829729534739 
   R-square value:  0.9540037 
   Adjusted R-square value:  0.9274023 

   ***********************************************************************
   Program stops at: 2024-11-15 04:32:06.674775 

8.5 Preparing Coordinates Data

Extract the x and y coordinates from the sf objects, which will be useful for spatial modeling.

coords_train <- st_coordinates(train_data)
coords_test <- st_coordinates(test_data)

write_rds(coords_train, "data/rds/coords_train.rds")
write_rds(coords_test, "data/rds/coords_test.rds")

train_data <- train_data %>% st_drop_geometry()

8.6 Calibrating Random Forest Model

Using the ranger package, fit a Random Forest model on train_data.

set.seed(1234)
rf <- ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE+ FLOOR_LEVEL + PROX_MRT +PROX_BUS_STOP+ PROX_SUPERMARKET + count_clinics + count_hawker_center + count_schoolzone,
             data=train_data)

# Save the model
write_rds(rf, "data/rds/rf.rds")

8.7Calibrating Geographical Random Forest Model

Using the grf function in the SpatialML package

set.seed(1234)
gwRF_adaptive <- grf(formula = RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE+ FLOOR_LEVEL + PROX_MRT +PROX_BUS_STOP+ PROX_SUPERMARKET + count_clinics + count_hawker_center + count_schoolzone,
                     dframe=train_data, 
                     bw=55,
                     kernel="adaptive",
                     coords=coords_train)

Number of Observations: 742
Number of Independent Variables: 9
Kernel: Adaptive
Neightbours: 55

--------------- Global ML Model Summary ---------------
Ranger result

Call:
 ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE + FLOOR_LEVEL +      PROX_MRT + PROX_BUS_STOP + PROX_SUPERMARKET + count_clinics +      count_hawker_center + count_schoolzone, data = train_data,      num.trees = 500, mtry = 3, importance = "impurity", num.threads = NULL) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      742 
Number of independent variables:  9 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       6990351252 
R squared (OOB):                  0.7128534 

Importance:
     FLOOR_AREA_SQM            UNIT_AGE         FLOOR_LEVEL            PROX_MRT 
       1.405120e+12        4.230223e+12        4.405082e+12        1.741839e+12 
      PROX_BUS_STOP    PROX_SUPERMARKET       count_clinics count_hawker_center 
       1.183246e+12        1.057512e+12        1.302310e+12        1.654157e+12 
   count_schoolzone 
       4.153670e+11 

Mean Square Error (Not OOB): 1437947254.091
R-squared (Not OOB) %: 94.085
AIC (Not OOB): 15666.17
AICc (Not OOB): 15666.471

--------------- Local Model Summary ---------------

Residuals OOB:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-246290  -29308   -3210   -2081   23492  298915 

Residuals Predicted (Not OOB):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-60678.5  -5377.1   -402.8   -138.7   4941.4  81329.7 

Local Variable Importance:
                           Min          Max         Mean          StD
FLOOR_AREA_SQM      7188555240 4.909814e+11  73922810814  85391993951
UNIT_AGE            9268444838 1.029969e+12 252150186902 324210542482
FLOOR_LEVEL         6986411093 9.273832e+11  99070468430 142186160540
PROX_MRT            4089299733 4.688317e+11  77971884997 100395411795
PROX_BUS_STOP       4928756987 3.182668e+11  38149218960  48704790235
PROX_SUPERMARKET    5019768869 2.120083e+11  46488733804  42627504945
count_clinics       2323146882 3.780597e+11  52819903249  76273846773
count_hawker_center          0 1.602772e+11  14999348883  21455875752
count_schoolzone     779705563 3.588960e+11  18693503937  45134421838

Mean squared error (OOB): 3254028652.754
R-squared (OOB) %: 86.615
AIC (OOB): 16272.144
AICc (OOB): 16272.445
Mean squared error Predicted (Not OOB): 133396112.001
R-squared Predicted (Not OOB) %: 99.451
AIC Predicted (Not OOB): 13901.954
AICc Predicted (Not OOB): 13902.255

Calculation time (in seconds): 6.7051
write_rds(gwRF_adaptive, "data/rds/gwRF_adaptive.rds")
variable_importance <- gwRF_adaptive$Global.Model$variable.importance
sort(variable_importance, decreasing = TRUE)
        FLOOR_LEVEL            UNIT_AGE            PROX_MRT count_hawker_center 
       4.405082e+12        4.230223e+12        1.741839e+12        1.654157e+12 
     FLOOR_AREA_SQM       count_clinics       PROX_BUS_STOP    PROX_SUPERMARKET 
       1.405120e+12        1.302310e+12        1.183246e+12        1.057512e+12 
   count_schoolzone 
       4.153670e+11 

9.Model Evaluation

For the Model Evaluation section, assess the performance of each of your three models: GWR (Geographically Weighted Regression), GWRF (Geographically Weighted Random Forest), and MLR (Multiple Linear Regression).

Model Evaluation Metrics: Root Mean Square Error (RMSE): Measures the average magnitude of error. A lower RMSE indicates better model accuracy. Mean Absolute Error (MAE): Shows the average absolute error, providing insight into the average deviation of predictions from actual values. R-squared (R²): Explains the proportion of variance in the dependent variable that is predictable from the independent variables. Adjusted R-squared: Adjusts R² for the number of predictors in the model, providing a more reliable metric when comparing models with different numbers of variables.

9.1 Prepare Test Data

test_data2 <- test_data %>%
    mutate(LOG_RESALE_PRICE = log(RESALE_PRICE))

coords_test <- st_coordinates(test_data)

resale_test_nogeo <- cbind(test_data, coords_test) %>%
  st_drop_geometry()

9.2 Multiple Linear Regression

test_data2$PREDICT_LOG_MLR <- predict(train_mlr2, newdata = test_data2)


test_data2$PREDICT_MLR <- exp(test_data2$PREDICT_LOG_MLR) 
test_data2$ACTUAL_RESALE_PRICE <- exp(test_data2$LOG_RESALE_PRICE)  


rmse_mlr <- rmse(test_data2$ACTUAL_RESALE_PRICE, test_data2$PREDICT_MLR)
mae_mlr <- mae(test_data2$ACTUAL_RESALE_PRICE, test_data2$PREDICT_MLR)


r_squared_mlr <- summary(train_mlr2)$r.squared
adj_r_squared_mlr <- summary(train_mlr2)$adj.r.squared


cat("MLR RMSE:", rmse_mlr, "MAE:", mae_mlr, "R_squared:", r_squared_mlr, "Adjusted R_squared:", adj_r_squared_mlr, "\n")
MLR RMSE: 92155.49 MAE: 66285.86 R_squared: 0.617096 Adjusted R_squared: 0.612917 

9.3 Geographically Weighted Random Forest (GWRF)

gwRF_pred <- predict.grf(
  gwRF_adaptive,               # the fitted GWRF model
  resale_test_nogeo,           # test data without geometry
  x.var.name = "X",            # column name for X coordinate
  y.var.name = "Y",            # column name for Y coordinate
  local.w = 1,                 # local weight
  global.w = 0                 # global weight
)

GRF_pred_df <- data.frame(GRF_pred = gwRF_pred)
rmse_gwrf <- rmse(resale_test_nogeo$RESALE_PRICE, GRF_pred_df$GRF_pred)
mae_gwrf <- mae(resale_test_nogeo$RESALE_PRICE, GRF_pred_df$GRF_pred)
ss_res <- sum((resale_test_nogeo$RESALE_PRICE - GRF_pred_df$GRF_pred)^2)
ss_tot <- sum((resale_test_nogeo$RESALE_PRICE - mean(resale_test_nogeo$RESALE_PRICE))^2)
r_squared_gwrf <- 1 - (ss_res / ss_tot)
n <- nrow(resale_test_nogeo)  # number of observations
p <- length(gwRF_adaptive$Global.Model$variable.importance)  # number of predictors
adj_r_squared_gwrf <- 1 - ((1 - r_squared_gwrf) * (n - 1) / (n - p - 1))

cat("GWRF RMSE:", rmse_gwrf, "MAE:", mae_gwrf, "R_squared:", r_squared_gwrf, "Adjusted R_squared:", adj_r_squared_gwrf, "\n")
GWRF RMSE: 58630.75 MAE: 39470.21 R_squared: 0.847518 Adjusted R_squared: 0.839631 

9.5 Evaluation_summary

evaluation_summary <- data.frame(
  Model = c("MLR" , "GWRF"),
  RMSE = c(rmse_mlr, rmse_gwrf),
  MAE = c(mae_mlr, mae_gwrf),
  R_Squared = c(r_squared_mlr, r_squared_gwrf),
  Adjusted_R_Squared = c(adj_r_squared_mlr, adj_r_squared_gwrf)
)

print(evaluation_summary)
  Model     RMSE      MAE R_Squared Adjusted_R_Squared
1   MLR 92155.49 66285.86  0.617096           0.612917
2  GWRF 58630.75 39470.21  0.847518           0.839631

A lower RMSE indicates that the GWRF model has less prediction error compared to MLR. This suggests that GWRF is more accurate in predicting the resale price.

Similar to RMSE, a lower MAE in GWRF shows that on average, its predictions are closer to the actual values than those of the MLR model.

R-squared measures the proportion of variance in the dependent variable that is predictable from the independent variables. An R-squared of 0.8475 for GWRF suggests it explains about 85% of the variance in resale prices, whereas MLR explains only about 62%.

Adjusted R-squared accounts for the number of predictors in the model, giving a more accurate measure of the model’s explanatory power. The higher Adjusted R-squared for GWRF indicates that it provides a significantly better fit than MLR, even after accounting for model complexity.

10.Conclusion

The GWRF model shows superior performance in predicting resale prices, with lower RMSE and MAE values, and higher R-squared and Adjusted R-squared values. This suggests that GWRF captures more spatial variation and complex relationships in the data compared to the MLR model. The spatial nature of GWRF likely accounts for the improvement, as it can model local variations more effectively than MLR, which assumes a global relationship.